feat(embedder): use summary for file embedding in semantic pipeline by yangxinxin-7 · Pull Request #765 · volcengine/OpenViking

yangxinxin-7 · 2026-03-19T07:16:27Z

Summary

When files in a code repository are processed through the semantic pipeline, use the pre-generated summary (AST skeleton or LLM summary) for embedding instead of raw file content
Add is_code_repo flag to SemanticMsg and propagate it through the pipeline: ResourceProcessor → Summarizer → SemanticMsg → SemanticDagExecutor
Detect code repositories via source_format == "repository" (set by CodeRepositoryParser) and pass is_code_repo=True when enqueuing semantic processing
use_summary in _file_summary_task is now gated on is_code_repo, so plain text / markdown / other non-repo resources continue to embed raw file content
Truncate AST skeleton to max_skeleton_chars (12000 chars, ~3000 tokens) before embedding to prevent oversized input
Add max_skeleton_chars config field to SemanticConfig

Why

Raw file content was being sent directly to the embedding API even when a semantic summary had already been generated. For large files this caused the embedding API to reject the request with a token limit error (e.g. OpenAI 8192 token limit). Using the bounded summary instead of raw content fixes this.

However, using summary for all file types (including markdown, plain text) was incorrect — for those files the raw content is the meaningful representation. Summary-based embedding is only appropriate for code files where AST skeletons provide a better semantic signal.

Paths unaffected:

index_resource direct indexing path (use_summary defaults to False)
Memory files (handled separately in memory_extractor.py)
Non-repo resources (markdown, plain text, etc.) — always use raw content

Closes

Closes #616

When files are processed through the semantic pipeline (SemanticDag), use the pre-generated summary (AST skeleton or LLM summary) for embedding instead of reading raw file content. This ensures code files, markdown, and other text files within a repository are indexed by their semantic summary rather than truncated raw content. - Add use_summary flag to VectorizeTask, _vectorize_single_file, and vectorize_file - Set use_summary=True in _file_summary_task when a non-empty summary is available - Truncate AST skeleton to max_skeleton_chars (12000 chars, ~3000 tokens) before embedding - Add max_skeleton_chars config field to SemanticConfig - index_resource and memory paths are unaffected (use_summary defaults to False) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

qin-ctx · 2026-03-19T07:28:52Z

openviking/storage/queuefs/semantic_dag.py


        try:
            if need_vectorize:
+                use_summary = bool(summary_dict.get("summary"))


只有代码会走这个路径吗

好像有点问题，我再check下

只有代码会走这个路径吗

这个已经改了下，现在是只有code repo的情况才会这样了

…r-embedding

…ext/doc files Add `is_code_repo` flag to `SemanticMsg` and propagate it through the pipeline so that summary-based embedding (AST skeleton) is only applied when processing a code repository (`source_format == "repository"`). For plain text, markdown, and other non-repo resources, raw file content is used for embedding as before. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

github-project-automation bot added this to OpenViking project Mar 19, 2026

github-project-automation bot moved this to Backlog in OpenViking project Mar 19, 2026

yangxinxin-7 requested a review from qin-ctx March 19, 2026 07:28

qin-ctx reviewed Mar 19, 2026

View reviewed changes

yangxinxin-7 and others added 2 commits March 19, 2026 16:55

Merge remote-tracking branch 'upstream/main' into feat/use-summary-fo…

552c6c4

…r-embedding

qin-ctx approved these changes Mar 19, 2026

View reviewed changes

qin-ctx merged commit 59352f8 into volcengine:main Mar 19, 2026
5 checks passed

github-project-automation bot moved this from Backlog to Done in OpenViking project Mar 19, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(embedder): use summary for file embedding in semantic pipeline#765

feat(embedder): use summary for file embedding in semantic pipeline#765
qin-ctx merged 3 commits intovolcengine:mainfrom
yangxinxin-7:feat/use-summary-for-embedding

yangxinxin-7 commented Mar 19, 2026 •

edited

Loading

Uh oh!

qin-ctx Mar 19, 2026

Uh oh!

yangxinxin-7 Mar 19, 2026

Uh oh!

yangxinxin-7 Mar 19, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

yangxinxin-7 commented Mar 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Why

Closes

Uh oh!

qin-ctx Mar 19, 2026

Choose a reason for hiding this comment

Uh oh!

yangxinxin-7 Mar 19, 2026

Choose a reason for hiding this comment

Uh oh!

yangxinxin-7 Mar 19, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

yangxinxin-7 commented Mar 19, 2026 •

edited

Loading